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Abstract —Single image super-resolution (SR) aims to estimate 
a high-resolution (HR) image from a low-resolution (LR) input. 
Image priors are commonly learned to regularize the otherwise 
seriously ill-posed SR problem, either using external LR-HR 
pairs or internal similar patterns. We propose joint SR to 
adaptively combine the advantages of both external and internal 
SR methods. We define two loss functions using sparse cod¬ 
ing based external examples, and epitomic matching based on 
internal examples, as well as a corresponding adaptive weight 
to automatically balance their contributions according to their 
reconstruction errors. Extensive SR results demonstrate the 
effectiveness of the proposed method over the existing state-of- 
the-art methods, and is also verified by our subjective evaluation 
study. 

Index Terms —Super-resolution, example-based methods, 
sparse coding, epitome 

EDICS Category: TEC-ISR Interpolation, Super-Resolution, 
and Mosaicing 


1. Introduction 

Super-resolution (SR) algorithms aim to constructing a high- 
resolution (HR) image from one or multiple low-resolution 
(LR) input frames (ll. This problem is essentially ill-posed 
because much information is lost in the HR to LR degradation 
process. Thus SR has to refer to strong image priors, that 
range from the simplest analytical smoothness assumptions, 
to more sophisticated statistical and structural priors learned 
from natural images n, i), m. The most popular single 
image SR methods rely on example-based learning techniques. 
Classical example-based methods learn the mapping between 
LR and HR image patches, from a large and representative 
external set of image pairs, and is thus denoted as external SR. 
Meanwhile, images generally possess a great amount of self¬ 
similarities; such a self-similarity property motivates a series 
of internal SR methods. With much progress being made, it is 
recognized that external and internal SR methods each suffer 
from their certain drawbacks. However, their complementary 
properties inspire us to propose the joint super-resolution 
(joint SR), that adaptively utilizes both external and internal 
examples for the SR task. The contributions of this paper are 
multi-fold: 

• We propose joint SR exploiting both external and internal 
examples, by defining an adaptive combination of differ¬ 
ent loss functions. 


• We apply epitomic matching ii to enforcing self¬ 
similarity in SR. Compared the the local nearest neighbor 
(NN) matching adopted in 13 , epitomic matching features 
more robustness to outlier features, as well as the ability 
to perform efficient non-local searching. 

• We carry out a human subjective evaluation survey to 
evaluate SR result quality based on visual perception, 
among several state-of-the-art methods. 

H. A Motivation Study of Joint SR 
A. Related Work 

The joint utilization of both external and internal examples 
has been most studied for image denoising ifTTl . Mosseri et. 
al. EH first proposed that some image patches inherently 
prefer internal examples for denoising, whereas other patches 
inherently prefer external denoising. Such a preference is in 
essence the tradeoff between noise-fitting versus signal-fitting. 
Burger et. al. ca proposed a learning-based approach that au¬ 
tomatically combines denoising results from an internal and an 
external method. The learned combining strategy outperforms 
both internal and external approaches across a wide range of 
images, being closer to theoretical bounds. 

In SR literature, while the most popular methods are based 
on either external or internal similarities, there have been 
limited efforts to utilize one to regularize the other. The authors 
in C9l incorporated both a local autoregressive (AR) model 
and a nonlocal self-similarity regularization term, into the 
sparse representation framework, weighted by constant coeffi¬ 
cients. Yang et. al. learned the (approximated) nonlinear 
SR mapping function from a collection of external images 
with the help of in-place self-similarity. More recently, an 
explicitly joint model is put forward in 1^ . including two loss 
functions by sparse coding and local scale invariance, bound 
by an indicator function to decide which loss function will 
work for each patch of the input image. Despite the existing 
efforts, there is little understanding on how the external and 
internal examples interact with each other in SR, how to judge 
the external versus internal preference for each patch, and 
how to make them collaborate towards an overall optimized 
performance. 

External SR methods use a universal set of example patches 
to predict the missing (high-frequency) information for the HR 
image. In (71, during the training phase, LR-HR patch pairs 
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(a) Train, the groundtruth of (b) Train, carriage region by (4) (c) Train, carriage region by (3 

carriage region PSNR = 24.91 dB, SSIM = 0.7915 PSNR = 24.13 dB, SSIM = 0.8085 



(d) Train, the groundtruth of (e) Train, brick region by (4) (f) Train, brick region by 0 

brick region PSNR = 18.84 dB, SSIM = 0.6576 PSNR = 19.78 dB, SSIM = 0.7037 



■ 


(g) Kid, the groundtruth of (h) Kid, left eye region by (4) (i) Kid, left eye region by 0 

left eye region PSNR = 22.43 dB, SSIM = 0.6286 PSNR = 22.18 dB, SSIM = 0.5993 



(j) Kid, the groundtruth of (k) Kid, sweater region by 0 (1) Kid, sweater region by 0 

sweater region PSNR = 24.16 dB, SSIM = 0.5444 PSNR = 24.45 dB, SSIM = 0.6018 


Fig. 1. Visual comparisons of both external and internal SR methods on different image local regions. The PSNR and SSIM values are also calculated and 
reported. 


are collected. Then in the test phase, each input LR patch 
is found with a nearest neighbor (NN) match in the LR patch 
pool, and its corresponding HR patch is selected as the output. 
It is further formulated as a kernel ridge regression (KRR) in 
m. More recently, a popular class of external SR methods 
are associated with the sparse coding technique 0, Oni. The 
patches of a natural image can be represented as a sparse 
linear combination of elements within a redundant pre-trained 
dictionary. Following this principle, the advanced coupled 
sparse coding is further proposed in H, Oni. External SR 
methods are known for their capabilities to produce plausible 
image appearances. However, there is no guarantee that an 
arbitrary input patch can be well matched or represented 


by the external dataset of limited size. When dealing with 
some unique features that rarely appear in the given dataset, 
external SR methods are prone to produce either noise or 
oversmoothness m It constitutes the inherent problem of 
any external SR method with a finite-size training set 03. 

Internal SR methods search for example patches from the 
input image itself, based on the fact that patches often tend 
to recur within the image ca, ca, im, or across different 
image scales m . Although internal examples provide a limited 
number of references, they are very relevant to the input image. 
However, this type of approach has a limited performance, 
especially for irregular patches without any discernible re¬ 
peating pattern ca. Due to the insufficient patch pairs, the 
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mismatches of internal examples often lead to more visual 
artifacts. In addition, epitome was proposed in im, El, ES 
to summarize both local and non-local similar patches and 
reduces the artifacts caused by neighborhood matching. We 
apply epitome as an internal SR technique in this paper, and 
evidence its advantages by our experiments. 


B. Comparing External and Internal SR Methods 

Both external and internal SR methods have different advan¬ 
tages and drawbacks. See Fig. for a few specific examples. 
The first two rows of images are cropped from the 3x SR 
results of the Train image, and the last two rows from the 
4x SR results of the Kid image. Each row of images are 
cropped from the same spatial location of the groundtruth 
image, the SR result by the external method IH, and the SR 
result by the internal method n, respectively. In the first row, 
the top contour of carriage (c) contains noticeable structural 
deformations, and the numbers “425” are more blurred than 
those in (b). That is because the numbers can more easily find 
counterparts or similar structure components from an external 
dataset; but within the same image, there are few recurring 
patterns that look visually similar to the numbers. Internal 
examples generate sharper SR results in images (f) than (e), 
since the bricks repeat their own patterns frequently, and thus 
the local neighborhood is rich in internal examples. Another 
winning case of external examples is between (h) and (i), as 
in the latter, inconsistent artifacts along the eyelid and around 
the eyeball are obvious. Because the eye region is composed 
of complex curves and fine structures, external examples 
encompass more suitable reference patches and perform a 
more natural-looking SR. In contrast, the repeating sweater 
textures lead to a sharper SR in (1) than that in (k). The PSNR 
and SSIM 1^ results are also calculated for all, which further 
validate our visual observations. 

These comparisons display the generally different, even 
complementary behaviors of external and internal SR. Based 
on the observations, we expect that the external examples 
contribute to visually pleasant SR results for smooth regions as 
well as some irregular structures that barely recur in the input. 
Meanwhile, internal examples serve as a powerful source 
to reproduce unique and singular features that rarely appear 
externally but repeat in the input image (or its different scales). 
Note that similar arguments have been validated statistically 
in the the image denoising literature ca. 


III. A Joint SR model 

Let X denote the HR image to be estimated from the LR 
input Y. and stand for the (i,jf)-th (i,j = 1,2...) 
patch from X and Y, respectively. Considering almost all SR 
methods work on patches, we define two loss functions ^g(') 
and ^x(’) in n patch-wise manner, which enforce the external 
and internal similarities, respectively. While one intuitive idea 
is to minimize a weighted combination of the two loss 
functions, a patch-wise (adaptive) weight uj{-) is needed to 
balance them. We hereby write our proposed joint SR in the 


general form: 

min 0/|Yij)^x(Xij, 0/|Y^j). 

Xij ,©G,©/ 

( 1 ) 

&G and 0/ are the latent representations of X^j over the 
spaces of external and self examples, respectively. The form 
f (Kij, &\Yij), f being £g, £x or uj, represents the function 
dependent on variables X^j and 0 (Qq or 0/). with Y^j 
known (we omit Yij in all formulations hereinafter). We will 
discuss each component in Q next. 

One specific form of joint SR will be discussed in this 
paper. However, note that with different choices of £g{'), £x{'), 
and a variety of methods can be accommodated in the 
framework. For example, if we set £g{') as the (adaptively 
reweighted) sparse coding term, while choosing £x{') equiv¬ 
alent to the two local and non-local similarity based terms, 
then Q becomes the model proposed in HU, with uj{’) being 
some empirically chosen constants. 


A. Sparse Coding for External Examples 

The HR and LR patch spaces {X^j} and {Y^j} are assumed 
to be tied by some mapping function. With a well-trained 
coupled dictionary pair (Dh, Di) (see Q) for details on 
training a coupled dictionary pair), the coupled sparse coding 
ca assumes that (X^j, Yij) tends to admit a common sparse 
representation a^j. Since X is unknown, Yang et. al. Col 
suggest to first infer the sparse code of Y ij with respect 

to Di, and then use it as an approximation of a^ (the sparse 
code of TLij with respect to Dh), to recover X^j ^ Dha^. 
We set Qq = ^ij and constitute the loss function enforcing 
external similarity: 


11a-iji'111 IlDia^jf Y^jll^ jlDha^j X^jll^. 

( 2 ) 


B. Epitomic Matching for Internal Examples 

1) The High Erequency Transfer Scheme: Based on the 
observation that singular features like edges and corners in 
small patches tend to repeat almost identically across different 
image scales, Freedman and Fattal El applied the “high 
frequency transfer” method to searching the high-frequency 
component for a target HR patch, by NN patch matching 
across scales. Defining a linear interpolation operator U and 
a downsampling operator V, for the input LR image Y, we 
first obtain its initial upsampled image X ^ = UiY), and a 
smoothed input image Y' = ViJAiY)). Given the smoothed 
patch X-|^, the missing high-frequency band of each unknown 
patch X^ is predicted by first solving a NN matching 


{m,n) = argmin(^^^)^w.. 




( 3 ) 


where }Vij is defined as a small local searching window on 
image Y'. We could also simply express it as (m,n) = 
/ 7 VAr(X-|^, Y'). With the co-located patch Y^^ from Y, the 
high-frequency band Y^^ — ^'mn is pasted onto X-|^, i.e., 

X E _ v'E I V _ V/ 
ij — ^ ^ mn- 
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2 ) EPI: Epitomic Matching for Internal SR: The matching 
of over the smoothed input image Y' makes the core 
step of the high frequency transfer scheme. However, the 
performance of NN matching 0 is degraded with the presence 
of noise and outliers. Moreover, the NN matching in (Si is 
restricted to a local window for efficiency, which potentially 
accounts for some rigid artifacts. 

Instead, we propose epitomic matching to replace NN 
matching in the above frequency transfer scheme. As a gen¬ 
erative model, epitome ED, summarizes a large set of 
raw image patches into a condensed representation in a way 
similar to Gaussian Mixture Models. We first learn an epitome 
Gy' from Y', and then match each over gy' rather than 
Y' directly. Assume (m,n) = fept{'X.[f ,eY'), where fept 
denotes the procedure of epitomic matching by gy'. It then 
follows the same way as in 0: = X,f + Y fYin Y jYiYi . 

the only difference here is the replacement of Jnn with /gpt- 
The high-frequency transfer scheme equipped with epitomic 
matching can thus be applied to SR by itself as well, named 
EPI for short, which will be included in our experiments in 
Section 4 and compared to the method using NN matching in 

Q. 

Since gy' summarizes the patches of the entire Y', the 
proposed epitomic matching benefits from non-local patch 
matching. In the absence of self-similar patches in the local 
neighborhood, epitomic matching weights refer to non-local 
matches, thereby effectively reducing the artifacts arising from 
local matching m in a restricted small neighborhood. In 
addition, note that each epitome patch summarizes a batch 
of similar raw patches in Y'. For any patch Y'ij that contains 
certain noise or outliers in Y', its has a small posterior and thus 
tends not be selected as candidate matches for X-|^, improving 
the robustness of matching. The algorithm details of epitomic 
matching are included in Appendix. 

Moreover, we can also incorporate Nearest Neighbor (NN) 
matching to our epitomic matching, leading to a enhanced 
patch matching scheme that features both non-local (by epit¬ 
ome) and local (by NN) matching. Suppose the high frequency 
components obtained by epitomic matching and NN matching 
for patch X-|^ are and respectively, we use a 

smart weighted average of the two as the final high frequency 
component 


“ 1 “ (1 ( 4 ) 

where the weight w = p(7^*|X-|^, g) denotes the probability 
of the most probable hidden mapping given the patch X-^. 
A higher w indicates that the patch X^|^ is more likely 
to have a reliable match by epitomic matching (with the 
probability measured through the corresponding most probable 
hidden mapping), thereby a larger weight is associated with 
the epitomic matching, and vice versa. This is the practical 
implementation of EPI that we used in the paper. 

Finally, we let 0/ = X^ and define 

= ( 5 ) 

where X^ is the internal SR result by epitomic matching. 


C. Learning the Adaptive Weights 

In im, Mosseri et.al. showed that the internal versus 
external preference is tightly related to the Signal-to-Noise- 
Ratio (SNR) estimate of each patch. Inspired by that finding, 
we could seek similar definitions of ’’noise” in SR based on 
the latent representation errors. The external noise is defined 
by the residual of sparse coding 

Ngia4j) = \\'Diaij-Yij\\%. ( 6 ) 

Meanwhile, the internal noise finds its counterpart definition 
by the epitomic matching error within 

Ni{Xp = \\Y'^^-pf\\l, (7) 

where Y^^ is the matching patch in Y' for X-^. 

Usually, the two “noises” are on the same magnitude level, 
which aligns with the fact that external- and internal-examples 
will have similar performances on many (such as homogenous 
regions). However, there do exist patches where the two 
have significant differences in performances, as shown in Fig. 

which means the patch has a strong preference toward 
one of them. In such cases, the “preferred” term needs to 
be sufficiently emphasized. We thus construct the following 
patch-wise adaptive weight {jp is the hyperparameter): 

= exp(p . [iV,(ay) - iV,(X,^)]). (8) 

When the internal noise becomes larger, the weight decays 
quickly to ensure that external similarity dominates, and vice 
versa. 


D. Optimization 

Directly solving ^ is very complex due to the its high 
nonlinearity and entanglement among all variables. Instead, 
we follow the coordinate descent fashion and solve the 
following three sub-problems iteratively. 

I) a.ij-subproblem: Fixing X^j and X^, we have the 
following minimization w.r.t aij 

min A||aij||i + ||Diaij - + ||Dhaij - Xij||| 

"+[^i(X,,-, X,^) ■ exp(-p ■ NiiXfp] ■ exp(p ■ Yg(a,,)). 

( 9 ) 

The major bottleneck of exactly solving lies in the last 
exponential term. We let denote the value solved in 
the last iteration. We then apply first-order Taylor expansion 
to the last term of the objective in ([^, with regard to Ng{aij) 
at aij = a^j, and solve the approximated problem as follows: 

min A||aij||i + (1 + C)\\'D\a.ij — + ||Dhaij — 

B-ij 

( 10 ) 

where C is the constant coefficient: 


C = [^x(X,„ X|) . exp(-p . 

= pex{Xij,Xp-Lu{a\j,Xfp 


b • exp(p • Ng{a.%)] 


( 11 ) 


f can be conveniently solved by the feature sign algorithm 
Note (10) is a valid approximation of (|^ since a^j and a^^ 
become quite close after a few iterations, so that the higher- 
order Taylor expansions can be reasonably ignored. 
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Another noticeable fact is that since C > 0, the second term 
is always emphasized more than the third term, which makes 
sense as Yij is the “accurate” LR image, while is just an 
estimate of the HR image and is thus less weighted. Further 


considering the formulation (11), C grows up as 


turns larger. That implies when external SR becomes the major 


source of “SR noise” on a patch in the last iteration, ( p^ will 
accordingly rely less on the last solved 


2) -subproblem: 
subproblem becomes 


Fixing 


and Xf 


the 


mm 


exp(-p- ||Y; 


r'E\ 


)4(X. 


IJ 5 


Xi), 




( 12 ) 


While in Section III.B.2, is directly computed from 
the input LR image, the objective in ( p^ is dependent on not 
only Xfj but also Xij, which is not necessarily minimized 
by the best match X^j obtained from solving fept- In our 
implementation, the K best candidates (K = 5) that yield 
minimum matching errors of solving f^pt are first obtained. 
Among all those candidates, we further select the one that 
minimizes the loss value as defined in ©• By this discrete 
search-type algorithm, X^j becomes a latent variable to be 
updated together with Xij per iteration, and is better suited for 
the global optimization than the simplistic solution by solving 
fept- 


3) Xij-subproblem: With both and 




fixed, the 


solution of Xij simply follows a weight least square (WLS) 
problem: 


X. 


ij 11F 


-|- cc(a- 






with an explicit solution: 






(13) 


(14) 


IV. Experiments 


A. Implementation Details 

We itemize the parameter and implementation settings for 
the following group of experiments: 

• We use 5x5 patches with one pixel overlapping for all 
experiments except those on SHD images in Section 4.4, 
where the patch size is 25 x 25 with five pixel overlapping. 

• In ([^, we adopt the Di and Dh trained in the same 
way as in El, due to the similar roles played by the 
dictionaries in their formulation and our Iq function. 
However, we are aware that such Di and are not 
optimized for the proposed method, and will integrate 
a specifically designed dictionary learning part in future 
work. A is empirically set as 1. 

• In ([ 5 ]), the size of the epitome is | of the image size. 

• In (|11[), we set p = 1 for all experiments. We also 


observed in experiments that a larger p will usually lead 
to a faster decrease in objective value, but the SR result 
quality may degrade a bit. 

We initialize ^ by solving coupled sparse coding in . 
Xij is initialized by bicubic interpolation. 


• We set the maximum iteration number to be 10 for 
the coordinate descent algorithm. For SHD cases, the 
maximum iteration number is adjusted to be 5. 

• For color images, we apply SR algorithms to the illu¬ 
minance channel only, as humans are more sensitive to 
illuminance changes. We then interpolate the color layers 
(Cb, Cr) using plain bi-cubic interpolation. 

B. Comparison with State-of-the-Art Results 

We compare the proposed method with the following selec¬ 
tion of competitive methods as follows, 

• Bi-Cubic Interpolation (''BCI” for short and similarly 
hereinafter), as a comparison baseline. 

• Coupled Sparse Coding (CSC) El, as the classical 
external-example-based SR. 

• Local Self-Example based SR (LSE) 0, as the classical 
intemal-example-based SR. 

• Epitome-based SR (EPI). We compare EPI to LSE to 
demonstrate the advantage of epitomic matching over the 
local NN matching. 

• SR based on In-place Example Regression (lER) 1^ . 
as the previous SR utilizing both external and internal 
information. 

• The proposed joint SR (JSR). 

We list the SR results (best viewed on a high-resolution 
display) for two test images: Temple and Train, by an ampli¬ 
fying factor of 3. PSNR and SSIM measurements, as well as 
zoomed local regions (using nearing neighbor interpolation), 
are available for different methods as well. 

In Fig. although greatly outperforming the naive BCI, 
the external-example based CSC tends to lose many fine 
details. In contrast, LSE brings out an overly sharp SR result 
with observable blockiness. EPI produces a more visually 
pleasing result, through searching for the matches over the 
entire input efficiently by the pre-trained epitome rather than 
a local neighborhood. Therefore, EPI substantially reduces 
the artifacts compared to LSE. But without any external 
information available, it is still incapable of inferring enough 
high-frequency details from the input solely, especially under 
a large amplifying factor. The result of lER greatly improves 
but is still accompanied with occasional small artifacts. Finally, 
JSR provides a clear recovery of the steps, and it reconstructs 
the most pillar textures. In Fig. JSR is the only algorithm 
which clearly recovers the number on the carrier and the bricks 
on the bridge simultaneously. The performance superiorities of 
JSR are also verified by the PSNR comparisons, where larger 
margins are obtained by JSR over others in both cases. 

Next, we move on to the more challenging 4x SR case, 
using the Chip image which is quite abundant in edges and 
textures. Since we have no ground truth for the Chip image of 
4x size, only visual comparisons are presented. Given such 
a large SR factor, the CSC result is a bit blurry around the 
characters on the surface of chip. Both LSE and EPI create 
jaggy artifacts along the long edge of the chip, as well as small 
structure distortions. The lER result cause less artifacts but in 
sacrifice of detail sharpness. The JSR result presents the best 
SR with few artifacts. 






6 



(a) BCI (PSNR = 25.29 dB, SSIM = 0.8762) (e) lER (PSNR = 25.54 dB, SSIM = 0.8937) 



(b) CSC (PSNR = 26.20 dB, SSIM = 0.8924) (f) JSR (PSNR = 27.87 dB, SSIM = 0.9327) 



(c) LSE (PSNR = 21.17 dB, SSIM = 0.7954) 


(g) Groundtruth 



(d) EPI (PSNR = 24.34 dB, SSIM = 0.8901) (h) LR input 


Fig. 2. 3 X SR results of the Temple image. 
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(a) BCI (PSNR = 26.14 dB, SSIM = 0.9403) (e) lER (PSNR = 24.80 dB, SSIM = 0.9323) 



(b) CSC (PSNR = 26.58 dB, SSIM = 0.9506) (f) JSR (PSNR = 28.02 dB, SSIM = 0.9796) 


(c) LSE (PSNR = 22.54 dB, SSIM = 0.8850) 


(g) Groundtruth 



(d) EPI (PSNR = 26.22 dB, SSIM = 0.9487) (h) LR input 


Fig. 3. 


3 X SR results of the Train image. 
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TABLE II 

The PSNR values (dB) with various eixed global weights (PSNR 
= 24.1734 dB with an adaptive weight) 


u; = 0.1 

00=1 

00 = 3 

00 = 5 

00 = 10 

23.13 

23.23 

23.32 

22.66 

21.22 


The key idea of JSR is utilizing the complementary behavior 
of both external and internal SR methods. Note when one 
inverse problem is better solved, it also makes a better param¬ 
eter estimate for solving the other. JSR is not a simple static 
weighted average of external SR (CSC) and internal SR (EPI). 
When optimized jointly, the external and internal subproblems 
can ’’boost” each other (through auxiliary variables), and each 
performs better than being applied independently. That is why 
JSR gets details that exist in neither internal or external SR 
result. 

To further verify the superiority of JSR numerically, we 
compare the average PSNR and SSIM results of a few 
recently-proposed, state-of-the-art single image SR methods, 
including CSC, LSE, the Adjusted Anchored Neighborhood 
Regression (A-\-) |2^ . and the latest Super-Resolution Convo¬ 
lutional Neural Network (SRCNN) 1211 . Table |T| reports the 
results on the widely-adopted Set 5 and Set 14 datasets, in 
terms of both PSNR and SSIM. Eirst, it is not a surprise to 
us, that JSR does not always yield higher PSNR than SRCNN, 
et. al., as the epitomic matching component is not meant to 
be optimized under Mean-Square-Error (MSE) measure, in 
contrast to the end-to-end MSE-driven regression adopted in 
SRCNN. However, it is notable that JSR is particularly more 
favorable by SSIM than other methods, owing to the self¬ 
similar examples that convey input-specific structural details. 
Considering SSIM measures image quality more consistently 
with human perception, the observation is in accordance with 
our human subject evaluation results (see Section IV. E). 

C. Effect of Adaptive Weight 

To demonstrate how the proposed joint SR will benefit from 
the learned adaptive weight ( pTj ), we compare 4x SR results 
of Kid image, between joint SR solving and its counterpart 
with fixed global weights , i.e. set the weight uj as constant for 
all patches. Table 1 shows that the joint SR with an adaptive 
weight gains a consistent PSNR advantage over the SR with 
a large range of fixed weights. 

More interestingly, we visualize the patch-wise weight maps 
of joint SR results in Pig. E-El using heat maps, as in 
Pig. The (i, j)-th pixel in the weight map denote the final 
weight of when the joint SR reaches a stable solution. 
All weights are normalized between [0,1], by the form of 
sigmoid function: for visualization purpose. A 

larger pixel value in the weigfit maps denote a smaller weight 
and thus a higher emphasis on external examples, and vice 
versa. Por Temple image. Pig. (a) clearly manifests that 
self examples dominate the SR of the temple building that 
is full of texture patterns. Most regions of Pig. (b) are 
close to 0.5, which means that cc((Tij,X^) is close to 1 and 
external and internal examples have similar performances on 


most patches. However, internal similarity makes more signif¬ 
icant contributions in reconstructing the brick regions, while 
external examples works remarkably better on the irregular 
contours of forests. Pinally, the Chip image is an example 
where external examples have advantages on the majority of 
patches. Considering self examples prove to create artifacts 
here (see Pig. (c) (d)), they are avoided in joint SR by the 
adaptive weights. 

D. SR Beyond Standard Definition: From HD Image to UHD 
Image 

In almost all SR literature, experiments are conducted 
with Standard-Definition (SD) images (720 x 480 or 720 
X 576 pixels) or smaller. The High-Definition (HD) formats: 
720p (1280 X 720 pixels) and 1080p (1920 x 1080 pixels) 
have become popular today. Moreover, Ultra High-Definition 
(UHD) TVs are hitting the consumer markets right now with 
the 3840 x 2160 resolution. It is thus quite interesting to 
explore whether SR algorithms established on SD images can 
be applied or adjusted for HD or UHD cases. In this section, 
we upscale HD images of 1280 x 720 pixels to UHD results 
of 3840 X 2160 pixels, using competitor methods and our 
joint SR algorithm. 

Since most HD and UHD images typically contain much 
more diverse textures and a richer collection of fine structures 
than SD images, we enlarge the patch size from 5x5 to 25x25 
(the dictionary pair is therefore re-trained as well) to capture 
more variations, meanwhile increasing the overlapping from 
one pixel to five pixels to ensure enough spatial consistency. 
Hereby JSR is compared with its two “component” algorithms, 
i.e., CSC and EPI. We choose several challenging SHD images 
(3840 X 2160 pixels) with very cluttered texture regions, 
downsampling them to HD size (1280 x 720 pixel) on which 
we apply the SR algorithm with a factor of 3. In all cases, our 
results are consistently sharper and clearer. The SR results 
(zoomed local regions) of the Leopard image are displayed in 
Eig. [^for examples, with the PSNR and SSIM measurements 
of full-size results. 

E. Subjective Evaluation 

We conduct an online subjective evaluation survey on 
the quality of SR results produced by all different methods 
in Section 4.2. Ground truth HR images are also included 
when they are available as references. Each participant of the 
survey is shown a set of HR image pairs obtained using two 
different methods for the same LR image. Eor each pair, the 
participant needs to decide which one is better than the other 
in terms of perceptual quality. The image pairs are drawn 
from all the competitive methods randomly, and the images 
winning the pairwise comparison will be compared again in 
the next round, until the best one is selected. We have a total of 
101 participants giving 1,047 pairwise comparisons, over six 
images which are commonly used as benchmark images in 
SR, with different scaling factors (^/Jx4, ChipxA, StatuexA, 
LeopardxS, Templex3 and TrainxS). We fit a Bradley-Terry 

^ http://www.ifp.illinois.edu/~wang308/survey 










(c) LSE 

Fig. 4. 4x SR results of the Chip image. 


(f) JSR 
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TABLE I 

Average PSNR (dB) and SSIM pereormances comparisons on the Set 5 and Set 14 datasets 




Bicubic 

Sparse Coding (l^ 

Freedman et.al. 

A+ |22J 

SRCNN [21] 

JSR 

Set 5, st=2 

PSNR 

33.66 

35.27 

33.61 

36.24 

36.66 

36.71 

SSIM 

0.9299 

0.9540 

0.9375 

0.9544 

0.9542 

0.9573 

Set 5, st=3 

PSNR 

30.39 

31.42 

30.77 

32.59 

32.75 

32.54 

SSIM 

0.8682 

0.8821 

0.8774 

0.9088 

0.9090 

0.9186 

Set 14, st=2 

PSNR 

30.23 

31.34 

31.99 

32.58 

32.45 

32.54 

SSIM 

0.8687 

0.8928 

0.8921 

0.9056 

0.9067 

0.9082 

Set 14, st=3 

PSNR 

27.54 

28.31 

28.26 

29.13 

29.60 

29.49 

SSIM 

0.7736 

0.7954 

0.8043 

0.8188 

0.8215 

0.8242 




Fig. 5. The weight maps of (a) Temple image; (b) Train image; (c) Chip image. 



(c) Chip 


0.08 
0.07 
0.06 
0.05 

CD 

o 0.04 
w 

0.03 
0.02 
0.01 
0 

Fig. 7. Subjective SR quality scores for different methods. The ground truth 
has score 1. 

|[29]| model to estimate the subjective scores for each method so 
that they can be ranked. More experiment details are included 
in our Appendix. Figure [7] shows the estimated scores for 
the six SR methods in our evaluation. As expected, all SR 
methods receive much lower scores compared to ground truth 
(set as score 1), showing the huge challenge of the SR problem 
itself. Also, the bicubic interpolation is significantly worse 
than others. The proposed JSR method outperforms all other 
state-of-the-art methods by a large margin, which proves that 
JSR can produce more visually favorable HR images by human 
perception. 



V. Conclusion 

This paper presents a joint single image SR model, by 
learning from both external and internal examples. We define 
the two loss functions by sparse coding and epitomic matching, 
respectively, and construct an adaptive weight to balance the 


two terms. Experimental results demonstrate that joint SR 
outperforms existing state-of-the-art methods for various test 
images of different definitions and scaling factors, and is also 
significantly more favored by user perception. We will further 
integrate dictionary learning into the proposed scheme, as well 
as reducing its complexity. 

Appendix 

1. Epitomic Matching Algorithm 

We assume an epitome e of size Me x Ne, for an input 
image of size MxN, where Mg < M and < N. Similarly 
to GMMs, e contains three parameters a, ED, ESI: n, the 
Gaussian mean of size Mg x Ne\ 4>, the Gaussian variance 
of size Mg X and tt, the mixture coefficients. Suppose 
Q densely sampled, overlapped patches from the input image, 
i.e. Each Z/^ contains pixels with image coordinates 

S/c, and is associated with a hidden mapping Tu from 
to the epitome coordinates. All the Q patches are generated 
independently from the epitome and the corresponding hidden 
mappings as below: 


g g 

= l[p{Zk\%,e) (15) 


k=l 


k=l 


The probability p{Zk\Tk,G) in (15) is computed by the 
Gaussian distribution where the Gaussian component is spec¬ 
ified by the hidden mapping Tk- Tk behaves similar to the 
hidden variable in the traditional GMMs. 

Eigure illustrates the role that the hidden mapping plays 
in the epitome as well as the graphical model illustration for 
epitome. With all the above notations, our goal is to find 
the epitome e that maximizes the log likelihood function 
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(a) Full SHD image 


(b) Local region from SR result by BCI 
PSNR = 24.14 dB, SSIM = 0.9701 


(d) Local region from SR result by EPI 
PSNR = 23.58 dB, SSIM = 0.9656 



(c) Local region from SR result by CSC 
PSNR = 25.32 dB, SSIM = 0.9618 


(e) Local region from SR result by JSR 
PSNR = 25.82 dB, SSIM = 0.9746 


Fig. 6. 3x SR results of the Leopard image (local region displayed). 
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Input Image / 


(a) 


Fig. 8. (a) The hidden mapping Tk maps the image patch to its 

corresponding patch of the same size in e, and can be mapped to any 
possible epitome patch in accordance with Tfc. (b) The epitome graphical 
model 


e = argmaxlogp which can be solved by the 

Expectation-Maximization (EM) algorithm 

The Expectation step in the EM algorithm which computes 
the posterior of all the hidden mappings accounts for the 
most time consuming part of the learning process. Since 
the posterior of the hidden mappings for all the patches are 
independent of each other, they can be computed in parallel. 
Therefore, the learning process can be significantly accelerated 
by parallel computing. 

With the epitome ev' learned from the smoothed input 
image Y', the location of the matching patch in the epitome 
gy' for each patch is specified by the most probable 
hidden mapping for X-^: 

T*j = argmaxp (16) 

'Tij 

The top K patches in Y' with large posterior probabilities 
p (7^* I •, e) are regarded as the candidate matches for the patch 
X'^y, and the match Y^^ is the one in these K candidate 
patches which has minimum Sum of Squared Distance (SSD) 
to X-|^. Note that the indices of the K candidate patches in 
Y' for each epitome patch are pre-computed and stored when 
training the epitome Gy' from the smoothed input image Y', 
which makes epitomic matching efficient. 

EPI significantly reduces the artifacts and produces more 
visually pleasing SR results by the dynamic weighting Q, 
compared to the local NN matching method 0. 

2. Subjective Review Experiment 

The methods under comparison include BIC, CSC, LSE, 
lER, EPI, JSR. Ground truth HR images are also included 
when they are available as references. Each of the human 
subject participating in the evaluation is shown a set of HR 
image pairs obtained using two different methods for the same 
LR image. Eor each pair, the subject needs to decide which 
one is better than the other in terms of perceptual quality. 
The image pairs are drawn from all the competitive methods 
randomly, and the images winning the pairwise comparison 


will be compared again in the next round until the best one is 
selected. 

We have^-a>s(ptal of lOD-'pasticipants giving 1,047 pair- 
Mjse conma^o no owm a ®ages with different scaling 
factors (‘T^id^d, “Chip’S^i<^‘Statue”x4, “Leopard” x 3, 
“Temple” X 3 and “Train” x 3). Not every participant completed 
all the coraf^an^ons but their mrtial responses are still useful. 
All the eYaiTi4tiiDiP"re;&^^s summarized into a 7x7 

winning malllx W for^m^|)ds (including ground truth), 
based on yfi^fwg,A^^rabky^erry [291 model to estimate 
the subjecm^ s/core for each method so that they can be 
ranked. In me Bradleymodel,0ie probability that an 
object X is favored over Y is assumea to be 


p{x yY) = 


<b) 

esx + 


1 

1 -h ’ 


(17) 


where sx and sy are the subjective scores for X and Y. 
The scores s for all the objects can be jointly estimated by 
maximizing the log likelihood of the pairwise comparison 
observations: 


max 


Y 


■ e*-? 


(18) 


where Wij is the {i,j)-th element in the winning matrix W, 
representing the number of times when method i is favored 
over method j. We use the Newton-Raphson method to solve 
Eq. ((TS]) and set the score for ground truth method as 1 to 
avoid the scale issue. 

Eig. [7] shows the estimated scores for six SR methods in 
our evaluation. As expected, all the SR methods have much 
lower scores than ground truth, showing the great challenge 
in SR problem. Also, the bicubic interpolation is significantly 
worse than other SR methods. The proposed JSR method 
outperforms other previous state-of-the-art methods by a large 
margin, which verifies that JSR can produce visually more 
pleasant HR images than other approaches. 
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